1 Youth Risk Behavior Surveillance

Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.

1.1 Loading Data

data(yrbss)  # Reading in the data
glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender                   <chr> "female", "female", "female", "female", "fema…
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race                     <chr> "Black or African American", "Black or Africa…
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight                   <dbl> NA, NA, 84.4, 55.8, 46.7, 67.1, 131.5, 71.2, …
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
skimr::skim(yrbss)  # Allows us to check for NA values etc
Data summary
Name yrbss
Number of rows 13583
Number of columns 13
_______________________
Column type frequency:
character 8
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 12 1.00 4 6 0 2 0
grade 79 0.99 1 5 0 5 0
hispanic 231 0.98 3 8 0 2 0
race 2805 0.79 5 41 0 5 0
helmet_12m 311 0.98 5 12 0 6 0
text_while_driving_30d 918 0.93 1 13 0 8 0
hours_tv_per_school_day 338 0.98 1 12 0 7 0
school_night_hours_sleep 1248 0.91 1 3 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 77 0.99 16.16 1.26 12.00 15.0 16.00 17.00 18.00 ▁▂▅▅▇
height 1004 0.93 1.69 0.10 1.27 1.6 1.68 1.78 2.11 ▁▅▇▃▁
weight 1004 0.93 67.91 16.90 29.94 56.2 64.41 76.20 180.99 ▆▇▂▁▁
physically_active_7d 273 0.98 3.90 2.56 0.00 2.0 4.00 7.00 7.00 ▆▂▅▃▇
strength_training_7d 1176 0.91 2.95 2.58 0.00 0.0 3.00 5.00 7.00 ▇▂▅▂▅

1.2 Exploratory Data Analysis

Employing the skim and glimpse functions above, we observe that in the data set there are 1004 missing values. These will be filtered out, and the data processed in the following lines:

new_yrbss <- drop_na(yrbss)  # Eliminate missing values in the data

weight_data <- new_yrbss %>% 
  select(weight) %>%  # Construct the summary statistics table
  summarise(mean_weight= mean(weight, na.rm= TRUE), 
            median_weight= median(weight, na.rm= TRUE), 
            std_weight= sd(weight,na.rm= TRUE),
            count = n(),
            t_critical = qt(0.975, count-1),
            se_weight = std_weight/sqrt(count),
            margin_of_error = t_critical * se_weight,
            CI_weight_low = mean_weight - margin_of_error,
            CI_weight_high = mean_weight + margin_of_error)
  
print.data.frame(head(weight_data))  # Present the summary statistics
##   mean_weight median_weight std_weight count t_critical se_weight
## 1        68.2          65.8       16.9  8351       1.96     0.185
##   margin_of_error CI_weight_low CI_weight_high
## 1           0.364          67.8           68.6
ggplot(new_yrbss, aes(x=weight))+
  geom_density(alpha=0.2) +   
  theme_bw() +                
  labs (
    title = "Density Plot for Weight",
    y     = "Density")

When we use visualization and look at the graph we can tell that the data is positively skewed. We would know this from our summary statistics because our mean is greater than our median.

We can then continue to process the data to produce tibbles more conducive to plotting boxplots for school children weight vs activity level:

ggplot(new_yrbss, aes(y= weight, x= physically_active_7d))+
  facet_wrap(vars(physically_active_7d)) +
  geom_boxplot() +  
  theme_bw() +                
  labs (
    title = "Weight vs. Physical Activity",
    x= "Days of Activity") +
  NULL

school_weight_yrbss <- new_yrbss %>%  # Creating separate table for those
                                      # active 3+ days
  mutate(physical_3plus = ifelse(physically_active_7d >=3, "YES", "NO")) %>% 
  count(physical_3plus) %>% 
  pivot_wider(names_from = physical_3plus,
                values_from = n ) 

new_school_weight_yrbss <- school_weight_yrbss %>% 
  mutate(total = YES + NO ,
      yes_proportion = YES / total,
      no_proportion = 1 - yes_proportion)
   

print.data.frame (head(new_school_weight_yrbss))
##     NO  YES total yes_proportion no_proportion
## 1 2656 5695  8351          0.682         0.318

Using this data, we will create a 95% confidence interval for the mean value of the weights of the school children:

new_school_weight_yrbss %>% 
  mutate(se = sqrt(no_proportion * (1-no_proportion)/ total),  # Standard error
      lower= no_proportion - 1.96*se,
      upper= no_proportion + 1.96*se
  )
## # A tibble: 1 × 8
##      NO   YES total yes_proportion no_proportion      se lower upper
##   <int> <int> <int>          <dbl>         <dbl>   <dbl> <dbl> <dbl>
## 1  2656  5695  8351          0.682         0.318 0.00510 0.308 0.328

Through the boxplot, we observe there is a relationship between the two variables. If we focus on the box alone in the boxplot, we see that those who engaged in physical activities for 0-2 days range from about 55-75, while those who exercise for at least 3 days weigh about 65-78. This would mean that those who exercise less actually weigh less as well. However, this is not taking into account the whiskers in the graph. Though out original expectation was that those who exercised less would weigh more, the data presented can be understood as showing the relation between regular exercise and higher levels of muscle mass.

box_plot_data <- new_yrbss %>% 
  mutate(physical_3plus= ifelse(physically_active_7d >=3, "YES", "NO"))
# Creating a new column for the desired data

ggplot(box_plot_data, aes(y= weight, x= physical_3plus))+
  geom_boxplot() +  
  theme_bw() +                
  labs (
    title = "Weight vs. Physical Activity",
    x= "Days of Activity") +
  NULL

1.3 Confidence Interval

Boxplots show how the medians of the two distributions compare, but we will also compare the means of the distributions using either a confidence interval or a hypothesis test.

yes_no_weight_data <- box_plot_data %>% 
  group_by(physical_3plus) %>% 
  summarise(mean_weight = mean(weight), 
            median_weight = median(weight), 
            std_weight = sd(weight),
            count = n(),
            t_critical = qt(0.975, count-1),
            se_weight = std_weight/sqrt(count),
            margin_of_error = t_critical * se_weight,
            CI_weight_low = mean_weight - margin_of_error,
            CI_weight_high = mean_weight + margin_of_error)

print.data.frame(head(yes_no_weight_data)) # Printing the summary statistics
##   physical_3plus mean_weight median_weight std_weight count t_critical
## 1             NO        67.1          62.6       18.0  2656       1.96
## 2            YES        68.7          65.8       16.4  5695       1.96
##   se_weight margin_of_error CI_weight_low CI_weight_high
## 1     0.349           0.684          66.5           67.8
## 2     0.218           0.426          68.3           69.1

1.4 Hypothesis test with formula

We can express our null hypothesis (H0) and our alternative (H1) as:

Null: mean_weight(YES) = mean_weight(NO) Alternative: mean_weight(YES) != mean_weight(NO)

t.test(weight ~ physical_3plus, data = box_plot_data)  # Uses R
## 
##  Welch Two Sample t-test
## 
## data:  weight by physical_3plus
## t = -4, df = 4781, p-value = 2e-04
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.334 -0.721
## sample estimates:
##  mean in group NO mean in group YES 
##              67.1              68.7

1.5 Hypothesis test with infer

Running the same test using a specialised package:

obs_diff <- box_plot_data %>%
  specify(weight ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("YES", "NO"))  # infer package

print.data.frame(head(obs_diff))
##   stat
## 1 1.53
null_dist <- box_plot_data %>%
  # specify variables
  specify(weight ~ physical_3plus) %>%
  
  # assume independence, i.e, there is no difference
  hypothesize(null = "independence") %>%
  
  # generate 1000 reps, of type "permute"
  generate(reps = 1000, type = "permute") %>%
  
  # calculate statistic of difference, namely "diff in means"
  calculate(stat = "diff in means", order = c("YES", "NO"))

We can visualize this null distribution with the following code:

ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram() + 
  NULL

And expand on it using the following code:

null_dist %>% visualize() +
  shade_p_value(obs_stat = obs_diff, direction = "two-sided")

null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

2 IMDB ratings: Differences between directors

In this section, we will reproduce the above graph, and perform hypothesis tests on the mean ratings of the films of Steven Spielberg and Tim Burton. Our H0 and H1 are outlined below:

Null: mean_rating (Steven Spielberg) = mean_rating (Tim Burton) Alternative: mean_rating (Steven Spielberg) != mean_rating (Tim Burton) Test-statistic: 3 (>2, so reject Null, true ratings do differ)

movies <- read_csv(here::here("data", "movies.csv"))  # Load in data
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
updated_movie <- movies %>% 
  filter(director== "Steven Spielberg" | director== "Tim Burton" ) 
# Filtered the movies data set into a smaller, more focused dataframe
new_movie <- updated_movie %>% 
  group_by(director) %>% 
  summarise (mean_rating= mean(rating), 
            median_rating= median(rating), 
            std_rating= sd(rating),
            count = n(),
            t_critical = qt(0.975, count-1),
            se_rating = std_rating/sqrt(count),
            margin_of_error = t_critical * se_rating,
            CI_rating_low = mean_rating - margin_of_error,
            CI_rating_high = mean_rating + margin_of_error) 

t.test(rating ~ director, data = updated_movie)
## 
##  Welch Two Sample t-test
## 
## data:  rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.16 1.13
## sample estimates:
## mean in group Steven Spielberg       mean in group Tim Burton 
##                           7.57                           6.93
movies_dist <- updated_movie %>%
  specify(rating ~ director) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))

obs_diff_movie <- updated_movie %>%
  specify(rating ~ director) %>%
  calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))

movies_dist %>% visualize() +
  shade_p_value(obs_stat = obs_diff_movie, direction = "two-sided")

print.data.frame(head(new_movie))
##           director mean_rating median_rating std_rating count t_critical
## 1 Steven Spielberg        7.57           7.6      0.695    23       2.07
## 2       Tim Burton        6.93           7.0      0.749    16       2.13
##   se_rating margin_of_error CI_rating_low CI_rating_high
## 1     0.145           0.301          7.27           7.87
## 2     0.187           0.399          6.53           7.33
round_df <- function(x, digits) {
    numeric_columns <- sapply(x, mode) == 'numeric'
    x[numeric_columns] <-  round(x[numeric_columns], digits)
    x}
rounded_movie <- round_df(new_movie, 2)

#Reproduce the graph

ggplot(rounded_movie, aes(x= mean_rating, y= director, color= director)) +
  geom_point(size= 3) +
  geom_errorbar(aes(xmin= CI_rating_low, xmax= CI_rating_high, color= director),
                width=0.1, size=1) +
  geom_rect(aes(xmax= 7.33, xmin= 7.27, ymin=0, ymax= 3, color= "grey") , 
            alpha= 0.3) +
  theme_bw() +
  labs( title= "Do Spielberg and Burton have the same mean IMDB ratings?",
        subtitle= "95% confidence intervals overlap",
        x= "Mean IMDB rating",
        y=" Directors") +
  scale_color_manual(values = c("grey", "brown2", "turquoise3")) +
  scale_y_discrete(limits = c("Tim Burton","Steven Spielberg")) +
  theme(legend.position = "none") +
  geom_text_repel(aes(label= mean_rating), color="black", size=6) +
  geom_text_repel(aes(label= CI_rating_low), 
                  position = position_nudge(x = -0.35), color="black") +
  geom_text_repel(aes(label= CI_rating_high), 
                  position = position_nudge(x = 0.3), color="black")

Ultimately, we can reject the null hypothesis, and show that at a 95% confidence level, the mean rating for Steven Spielberg films is higher than that of Tim Burton

3 Omega Group plc- Pay Discrimination

At the last board meeting of Omega Group Plc., the headquarters of a large multinational company, the issue was raised that women were being discriminated in the company, in the sense that the salaries were not the same for male and female executives. A quick analysis of a sample of 50 employees (of which 24 men and 26 women) revealed that the average salary for men was about 8,700 higher than for women. This seemed like a considerable difference, so it was decided that a further analysis of the company salaries was warranted.

You are asked to carry out the analysis. The objective is to find out whether there is indeed a significant difference between the salaries of men and women, and whether the difference is due to discrimination or whether it is based on another, possibly valid, determining factor.

3.1 Loading the data

omega <- read_csv(here::here("data", "omega.csv"))
glimpse(omega) # examine the data frame
## Rows: 50
## Columns: 3
## $ salary     <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 635…
## $ gender     <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, 3…

3.2 Relationship Salary - Gender ?

The data frame omega contains the salaries for the sample of 50 executives in the company. We will analyse the dataset and investigate to see if there is a significant difference between the salaries of the male and female executives.

Null: mean_salary(female) = mean_salary(male) Alternative: mean_salary(female) != mean_salary(male)

# Summary Statistics of salary by gender
mosaic::favstats (salary ~ gender, data=omega)
##   gender   min    Q1 median    Q3   max  mean   sd  n missing
## 1 female 47033 60338  64618 70033 78800 64543 7567 26       0
## 2   male 54768 68331  74675 78568 84576 73239 7463 24       0
omega_updated <- omega %>% 
  group_by(gender) %>% 
  summarise(mean_salary= mean(salary),
            median_salary= median(salary),
            std_salary= sd(salary),
            count = n(),
            t_critical = qt(0.975, count-1),
            se_salary = std_salary/sqrt(count),
            margin_of_error = t_critical * se_salary,
            CI_salary_low = mean_salary - margin_of_error,
            CI_salary_high = mean_salary + margin_of_error)
print.data.frame(head(omega_updated))            
##   gender mean_salary median_salary std_salary count t_critical se_salary
## 1 female       64543         64618       7567    26       2.06      1484
## 2   male       73239         74675       7463    24       2.07      1523
##   margin_of_error CI_salary_low CI_salary_high
## 1            3056         61486          67599
## 2            3151         70088          76390

We can conclude that men are paid more than women at Omega Group. Confidence intervals for men and women do not overlap. This means the true means for men and women cannot be same.

Even though this differences could be due to factors other than gender, there is still reason for concern. The issue should be analysed further.

# hypothesis testing using t.test() 
t.test(salary ~ gender, data = omega)
## 
##  Welch Two Sample t-test
## 
## data:  salary by gender
## t = -4, df = 48, p-value = 2e-04
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -12973  -4420
## sample estimates:
## mean in group female   mean in group male 
##                64543                73239
# hypothesis testing using infer package
omega_dist <- omega %>%
  specify(salary ~ gender) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("female", "male"))

obs_diff_gender <- omega %>%
  specify(salary ~ gender) %>%
  calculate(stat = "diff in means", order = c("female", "male"))

omega_dist %>% visualize() +
  shade_p_value(obs_stat = obs_diff_gender, direction = "two-sided")

3.3 Relationship Experience - Gender?

Some may raise the issue that there was indeed a substantial difference between male and female salaries, but that this was attributable to other reasons such as differences in experience. A questionnaire send out to the 50 executives in the sample reveals that the average experience of the men is approximately 21 years, whereas the women only have about 7 years experience on average (see table below).

# Summary Statistics of salary by gender
favstats (experience ~ gender, data=omega)
##   gender min    Q1 median   Q3 max  mean    sd  n missing
## 1 female   0  0.25    3.0 14.0  29  7.38  8.51 26       0
## 2   male   1 15.75   19.5 31.2  44 21.12 10.92 24       0

Null: mean_experience (male) = mean_experience (female) Alternative: mean_experience (male) != mean_experience (female)

We coded to derive the t-value and p-value with the codes and results provided below:

experience_gender <- omega %>% 
  group_by(gender) %>% 
  summarise(mean_experience= mean(experience),
            median_experience= median(experience),
            std_experience= sd(experience),
            count = n(),
            t_critical = qt(0.975, count-1),
            se_experience = std_experience/sqrt(count),
            margin_of_error = t_critical * se_experience,
            CI_experience_low = mean_experience - margin_of_error,
            CI_experience_high = mean_experience + margin_of_error)

print.data.frame(head(experience_gender))
##   gender mean_experience median_experience std_experience count t_critical
## 1 female            7.38               3.0           8.51    26       2.06
## 2   male           21.12              19.5          10.92    24       2.07
##   se_experience margin_of_error CI_experience_low CI_experience_high
## 1          1.67            3.44              3.95               10.8
## 2          2.23            4.61             16.52               25.7
# hypothesis testing using t.test() 
t.test(experience ~ gender, data = omega)
## 
##  Welch Two Sample t-test
## 
## data:  experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.35  -8.13
## sample estimates:
## mean in group female   mean in group male 
##                 7.38                21.12
# hypothesis testing using infer package
omega_dist_experience <- omega %>%
  specify(experience ~ gender) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("female", "male"))

obs_diff_experience <- omega %>%
  specify(experience ~ gender) %>%
  calculate(stat = "diff in means", order = c("female", "male"))

omega_dist_experience %>% visualize() +
  shade_p_value(obs_stat = obs_diff_experience, direction = "two-sided")

Since the t value is -5 and p-value < 0.05, we reject the null. So, yes, there is a difference between the average experience of the male and female executives.

3.4 Relationship Salary - Experience ?

If someone at the meeting argue that clearly, a more thorough analysis of the relationship between salary and experience is required before any conclusion can be drawn about whether there is any gender-based salary discrimination in the company.

ggplot(omega, aes(x= experience, y=salary)) +
  geom_point() +
  geom_smooth() +
  theme_bw() +
  labs(title= "Salary vs. Experience",
       x="Experience",
       y= "Salary") +
  NULL

experience_salary <- omega %>% 
  group_by(experience) %>% 
  summarise(mean_salary= mean(salary),  # Summary statistics
            median_salary= median(salary),
            std_salary= sd(salary),
            count = n(),
            t_critical = qt(0.975, count-1),
            se_salary = std_salary/sqrt(count),
            margin_of_error = t_critical * se_salary,
            CI_experience_low = mean_salary - margin_of_error,
            CI_experience_high = mean_salary + margin_of_error)

print.data.frame(head(experience_salary))
##   experience mean_salary median_salary std_salary count t_critical se_salary
## 1          0       55461         55490       5111     7       2.45      1932
## 2          1       63174         63174        511     2      12.71       361
## 3          2       62839         63948       2783     4       3.18      1392
## 4          3       61816         63948       6261     3       4.30      3615
## 5          4       66114         66114         NA     1        NaN        NA
## 6          5       62194         62194         NA     1        NaN        NA
##   margin_of_error CI_experience_low CI_experience_high
## 1            4727             50734              60187
## 2            4587             58587              67761
## 3            4429             58410              67268
## 4           15552             46264              77368
## 5             NaN               NaN                NaN
## 6             NaN               NaN                NaN

It can be seen that there is indeed a positive correlation between salary and experience. However, the significance of this statement faces several challenges and requires further analysis.

Firstly, as can be seen in the scatter plot, the 95% confidence interval is painted in grey. Many of the plots in the graph lay outside such intervals, which challenges the validity of the result in the 95% confidence interval. A more in-depth hypothesis test is expected to test the significance of this result.

Secondly, the graph shows that, when experience is higher than around 30, there may be a negative relationship between the two parameters. This can be reasonably explained with retirement or age (high age may be seen as a negative factor when deciding the salary in certain industries). To which extent this relationship validates should also be taken into consideration.

Thirdly, even if the positive relationship stands and survives from the two challenges above, it is unclear if salary increase along with experience, or experience increase with salary. It makes sense that more experienced workers can earn higher salaries due to higher work efficiency or productivity, but it also makes sense that people tend to stay longer in the industry if the salary is high enough.

Therefore, we can generally conclude that the graph above shows a positive correlation between salary and experience in general, but the validity of this statement/relationship required more in-depth analysis.

3.5 Check correlations between the data

We can use GGally:ggpairs() to create a scatterplot and correlation matrix. Essentially, we change the order our variables will appear in and have the dependent variable (Y), salary, as last in our list. We then pipe the dataframe to ggpairs() with aes arguments to colour by gender and make ths plots somewhat transparent (alpha = 0.3).

omega %>% 
  select(gender, experience, salary) %>% #order variables they will appear 
  #  in ggpairs()
  ggpairs(aes(colour=gender, alpha = 0.3))+
  theme_bw()

The scatterplot shows that there is a somewhat positive relationship between salary and experience for both males and females. However, this positive relationship between both variables is more obvious for males.

On the contrary, for females, the positive relationship does not appear to be as strong. There are a few datapoints where a couple of females have 0 work experience, but their salaries are rather varied. The salary range is rather large, being nearly 60,000. This may suggest that factors other than work experience contribute to the determination of salary for females, especially when they have no experience at all.

4 Challenge 1: Yield Curve inversion

First, we will load the yield curve data file that contains data on the yield curve since 1960-01-01

yield_curve <- read_csv(here::here("data", "yield_curve.csv"))

glimpse(yield_curve)  # Examining loaded data for NAs etc
## Rows: 6,884
## Columns: 5
## $ date      <date> 1960-01-01, 1960-02-01, 1960-03-01, 1960-04-01, 1960-05-01,…
## $ series_id <chr> "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS…
## $ value     <dbl> 4.35, 3.96, 3.31, 3.23, 3.29, 2.46, 2.30, 2.30, 2.48, 2.30, …
## $ maturity  <chr> "3m", "3m", "3m", "3m", "3m", "3m", "3m", "3m", "3m", "3m", …
## $ duration  <chr> "3-Month Treasury Bill", "3-Month Treasury Bill", "3-Month T…

Our dataframe yield_curve has five columns (variables):

  • date: already a date object
  • series_id: the FRED database ticker symbol
  • value: the actual yield on that date
  • maturity: a short hand for the maturity of the bond
  • duration: the duration, written out in all its glory!

4.1 Plotting the yield curve

4.1.1 Yields on US rates by duration since 1960

new_yield_curve <- yield_curve
new_yield_curve$duration <- factor(new_yield_curve$duration, 
                  levels= c("3-Month Treasury Bill", "6-Month Treasury Bill",
                            "1-Year Treasury Rate", "2-Year Treasury Rate",
                            "3-Year Treasury Rate", "5-Year Treasury Rate",
                            "7-Year Treasury Rate", "10-Year Treasury Rate",
                            "20-Year Treasury Rate", "30-Year Treasury Rate") )

ggplot(new_yield_curve, aes(x = date, y = value, color = duration)) +
  geom_line() +
  facet_wrap(vars(duration), ncol=2) +
  theme_bw()+
  theme(legend.position = "none") +
  labs(title= "Yields on U.S. Treasury rates since 1960",
       y= "%")

4.1.2 Monthly yields on US rates by duration since 1999 on a year-by-year basis

library(lubridate)

monthly_yield  <- new_yield_curve %>% 
  mutate(year= year(date),
         month= month(date)) %>% 
  filter(year>=1999) %>% 
  group_by(year, month)

print.data.frame(head(monthly_yield))
##         date series_id value maturity              duration year month
## 1 1999-01-01     TB3MS  4.34       3m 3-Month Treasury Bill 1999     1
## 2 1999-02-01     TB3MS  4.44       3m 3-Month Treasury Bill 1999     2
## 3 1999-03-01     TB3MS  4.44       3m 3-Month Treasury Bill 1999     3
## 4 1999-04-01     TB3MS  4.29       3m 3-Month Treasury Bill 1999     4
## 5 1999-05-01     TB3MS  4.50       3m 3-Month Treasury Bill 1999     5
## 6 1999-06-01     TB3MS  4.57       3m 3-Month Treasury Bill 1999     6
ggplot(monthly_yield, aes(x= maturity, y= value, group = month, color=year)) +
  geom_line() +
  facet_wrap(vars(year), ncol= 4)+
  theme(legend.position = "none") +
  scale_x_discrete(limits = c("3m","6m", "1y","2y", "3y", "5y", "7y", "10y",
                              "20y", "30sy")) +
  labs(title= "U.S. Yield Curve",
       y= "Yield (%)") + 
  NULL

4.1.3 3-month and 10-year yields since 1999

three_ten_yield <- monthly_yield %>% 
  filter(duration == "3-Month Treasury Bill" | 
           duration == "10-Year Treasury Rate")

print.data.frame(head(three_ten_yield))
##         date series_id value maturity              duration year month
## 1 1999-01-01     TB3MS  4.34       3m 3-Month Treasury Bill 1999     1
## 2 1999-02-01     TB3MS  4.44       3m 3-Month Treasury Bill 1999     2
## 3 1999-03-01     TB3MS  4.44       3m 3-Month Treasury Bill 1999     3
## 4 1999-04-01     TB3MS  4.29       3m 3-Month Treasury Bill 1999     4
## 5 1999-05-01     TB3MS  4.50       3m 3-Month Treasury Bill 1999     5
## 6 1999-06-01     TB3MS  4.57       3m 3-Month Treasury Bill 1999     6
ggplot(three_ten_yield, aes(x=date, y= value, color=duration))+
  geom_line()+
  theme_bw() +
  labs(title= "Yields on 3-month and 10-year US Treasury rates since 1999",
       y= "%") + 
  NULL

According to Wikipedia’s list of recession in the United States, since 1999 there have been two recession in the US: between Mar 2001–Nov 2001 and between Dec 2007–June 2009. However, the plots here seem to show that the drastic changes in yields occure after the given start date of the recession, and also that there are many other periods of the yield curve flattening which are not attached to any recession.

# get US recession dates after 1946 from Wikipedia 
# https://en.wikipedia.org/wiki/List_of_recessions_in_the_United_States

recessions <- tibble(
  from = c("1948-11-01", "1953-07-01", "1957-08-01", "1960-04-01", 
           "1969-12-01", "1973-11-01", "1980-01-01","1981-07-01", "1990-07-01", 
           "2001-03-01", "2007-12-01","2020-02-01"),  
  to = c("1949-10-01", "1954-05-01", "1958-04-01", "1961-02-01", "1970-11-01", 
         "1975-03-01", "1980-07-01", "1982-11-01", "1991-03-01", "2001-11-01",
         "2009-06-01", "2020-04-30") 
  )  %>% 
  mutate(From = ymd(from), 
         To=ymd(to),
         duration_days = To-From)

print.data.frame(head(recessions))
##         from         to       From         To duration_days
## 1 1948-11-01 1949-10-01 1948-11-01 1949-10-01      334 days
## 2 1953-07-01 1954-05-01 1953-07-01 1954-05-01      304 days
## 3 1957-08-01 1958-04-01 1957-08-01 1958-04-01      243 days
## 4 1960-04-01 1961-02-01 1960-04-01 1961-02-01      306 days
## 5 1969-12-01 1970-11-01 1969-12-01 1970-11-01      335 days
## 6 1973-11-01 1975-03-01 1973-11-01 1975-03-01      485 days
new_three_ten_yield <- yield_curve %>% 
  filter(duration == "3-Month Treasury Bill" | 
           duration == "10-Year Treasury Rate")

updated_three_ten_yield <- new_three_ten_yield %>% 
  select(date, duration, value) %>% 
  pivot_wider(names_from=duration,
               values_from=value) %>% 
  janitor::clean_names() %>% 
  mutate(difference= x10_year_treasury_rate - x3_month_treasury_bill)

updated_recessions <- recessions %>% 
  filter(year(From) >=1960)

#merged dataframe (updated_recessions & updated_three_ten_yield)
new_data <- merge(updated_three_ten_yield, updated_recessions)


ggplot(new_data, aes(x = date, y = difference))+
  geom_line() +
  geom_line(aes(y=0), color= "grey3") +
  
  geom_rect(aes(xmin= From, xmax=To, ymin=-3, ymax=5), fill="gray84") +
  
  geom_ribbon(aes(ymin=pmin(difference,0), ymax=0), fill="indianred3", 
              alpha=0.5) +
  geom_ribbon(aes(ymin=0, ymax=pmax(difference,0)), fill="skyblue3", 
              alpha=0.5) +
  
  geom_rug(data = subset(updated_three_ten_yield, difference < 0 ), 
           color="indianred3", sides="b") +
  geom_rug(data = subset(updated_three_ten_yield, difference >= 0 ), 
           color="skyblue3", sides="b") +

  theme(legend.position="none") +
  theme_minimal() +
  labs (
    title = "Yield Curve Inversion: 10-year minus 3-month U.S.Treasury rates",
    x = "", 
    y = "Difference (10 year - 3 month) yield in %",
  ) 

The above plot shows that recessions can also precede inversion, and so there may not be as strong a link as suggested.

5 Challenge 2: GDP components over time and among countries

At the risk of oversimplifying things, the main components of gross domestic product, GDP are personal consumption (C), business investment (I), government spending (G) and net exports (exports - imports).

The GDP data we will look at is from the United Nations’ National Accounts Main Aggregates Database, which contains estimates of total GDP and its components for all countries from 1970 to today. We will look at how GDP and its components have changed over time, and compare different countries and how much each component contributes to that country’s GDP.

UN_GDP_data  <-  read_excel(here::here("data", "Download-GDPconstant-USD-countries.xls"), # Excel filename
                sheet="Download-GDPconstant-USD-countr", # Sheet name
                skip=2) # Number of rows to skip


print.data.frame(head(UN_GDP_data))
##   CountryID     Country
## 1         4 Afghanistan
## 2         4 Afghanistan
## 3         4 Afghanistan
## 4         4 Afghanistan
## 5         4 Afghanistan
## 6         4 Afghanistan
##                                                                              IndicatorName
## 1                                                            Final consumption expenditure
## 2 Household consumption expenditure (including Non-profit institutions serving households)
## 3                                         General government final consumption expenditure
## 4                                                                  Gross capital formation
## 5       Gross fixed capital formation (including Acquisitions less disposals of valuables)
## 6                                                            Exports of goods and services
##       1970     1971     1972     1973     1974     1975     1976     1977
## 1 5.56e+09 5.33e+09 5.20e+09 5.75e+09 6.15e+09 6.32e+09 6.37e+09 6.90e+09
## 2 5.07e+09 4.84e+09 4.70e+09 5.21e+09 5.59e+09 5.65e+09 5.68e+09 6.15e+09
## 3 3.72e+08 3.82e+08 4.02e+08 4.21e+08 4.31e+08 5.98e+08 6.27e+08 6.76e+08
## 4 9.85e+08 1.05e+09 9.19e+08 9.19e+08 1.18e+09 1.37e+09 2.03e+09 1.92e+09
## 5 9.85e+08 1.05e+09 9.19e+08 9.19e+08 1.18e+09 1.37e+09 2.03e+09 1.92e+09
## 6 1.12e+08 1.45e+08 1.73e+08 2.18e+08 3.00e+08 3.16e+08 4.17e+08 4.31e+08
##       1978     1979     1980     1981     1982     1983     1984     1985
## 1 7.09e+09 6.92e+09 6.69e+09 6.81e+09 6.96e+09 7.30e+09 7.43e+09 7.45e+09
## 2 6.30e+09 6.15e+09 5.95e+09 6.06e+09 6.19e+09 6.49e+09 6.61e+09 6.63e+09
## 3 7.25e+08 7.08e+08 6.85e+08 6.97e+08 7.12e+08 7.47e+08 7.60e+08 7.63e+08
## 4 2.22e+09 2.07e+09 1.98e+09 2.06e+09 2.08e+09 2.19e+09 2.23e+09 2.23e+09
## 5 2.22e+09 2.07e+09 1.98e+09 2.06e+09 2.08e+09 2.19e+09 2.23e+09 2.23e+09
## 6 4.57e+08 4.89e+08 4.53e+08 4.60e+08 4.77e+08 4.96e+08 5.06e+08 5.08e+08
##       1986     1987     1988     1989     1990     1991     1992     1993
## 1 7.68e+09 6.89e+09 6.32e+09 5.87e+09 5.69e+09 5.28e+09 5.59e+09 4.36e+09
## 2 6.83e+09 6.12e+09 5.62e+09 5.22e+09 5.06e+09 4.70e+09 4.97e+09 3.87e+09
## 3 7.85e+08 7.05e+08 6.46e+08 6.01e+08 5.82e+08 5.40e+08 5.72e+08 4.46e+08
## 4 2.30e+09 2.07e+09 1.90e+09 1.76e+09 1.71e+09 1.51e+09 1.52e+09 1.13e+09
## 5 2.30e+09 2.07e+09 1.90e+09 1.76e+09 1.71e+09 1.51e+09 1.52e+09 1.13e+09
## 6 5.23e+08 4.69e+08 4.31e+08 4.00e+08 3.88e+08 4.15e+08 4.92e+08 4.22e+08
##       1994     1995     1996     1997     1998     1999     2000     2001
## 1 3.52e+09 5.46e+09 5.36e+09 5.25e+09 5.18e+09 5.09e+09 4.95e+09 4.70e+09
## 2 3.13e+09 4.86e+09 4.76e+09 4.67e+09 4.60e+09 4.52e+09 4.41e+09 4.17e+09
## 3 3.59e+08 5.60e+08 5.48e+08 5.36e+08 5.33e+08 5.17e+08 5.04e+08 4.95e+08
## 4 8.70e+08 1.29e+09 1.21e+09 1.14e+09 1.08e+09 1.02e+09 9.53e+08 1.00e+09
## 5 8.70e+08 1.29e+09 1.21e+09 1.14e+09 1.08e+09 1.02e+09 9.53e+08 1.00e+09
## 6 3.69e+08 6.16e+08 6.42e+08 6.64e+08 6.87e+08 7.04e+08 7.13e+08 6.54e+08
##       2002     2003     2004     2005     2006     2007     2008     2009
## 1 7.18e+09 8.87e+09 8.73e+09 1.01e+10 1.07e+10 1.20e+10 1.23e+10 1.29e+10
## 2 6.40e+09 7.89e+09 7.66e+09 9.00e+09 9.34e+09 1.04e+10 1.06e+10 1.10e+10
## 3 7.02e+08 9.06e+08 1.05e+09 1.06e+09 1.40e+09 1.71e+09 1.73e+09 2.15e+09
## 4 1.37e+09 1.54e+09 1.90e+09 2.06e+09 2.06e+09 2.17e+09 2.14e+09 3.12e+09
## 5 1.37e+09 1.54e+09 1.90e+09 2.06e+09 2.06e+09 2.17e+09 2.14e+09 3.12e+09
## 6 9.49e+08 1.41e+09 1.11e+09 1.14e+09 1.09e+09 1.03e+09 1.24e+09 1.53e+09
##       2010     2011     2012     2013     2014     2015     2016     2017
## 1 1.79e+10 1.97e+10 2.91e+10 3.48e+10 3.35e+10 3.53e+10 3.50e+10 3.51e+10
## 2 1.57e+10 1.70e+10 2.59e+10 3.14e+10 3.02e+10 3.19e+10 3.16e+10 3.16e+10
## 3 2.25e+09 2.69e+09 2.81e+09 2.81e+09 2.76e+09 2.81e+09 2.84e+09 2.85e+09
## 4 2.81e+09 2.61e+09 2.85e+09 2.75e+09 2.13e+09 2.29e+09 2.34e+09 2.24e+09
## 5 2.81e+09 2.61e+09 2.85e+09 2.75e+09 2.13e+09 2.29e+09 2.34e+09 2.24e+09
## 6 1.58e+09 1.72e+09 1.31e+09 8.34e+08 1.20e+09 9.10e+08 7.54e+08 7.60e+08

The first thing we need to do is to tidy the data, as it is in wide format and so we must make it into long, tidy format, and express all figures in billions (divide values by 1e9, or \(10^9\)), and you want to rename the indicators into something shorter.

ncol(UN_GDP_data) 
## [1] 51
tidy_GDP_data  <- UN_GDP_data %>%  
  pivot_longer(cols = 4:51, #columns 4 to 51
               names_to = "Year",
               values_to = "Country_GDP") %>% 
  mutate(Country_GDP = Country_GDP/1e9)

print.data.frame(head(tidy_GDP_data))
##   CountryID     Country                 IndicatorName Year Country_GDP
## 1         4 Afghanistan Final consumption expenditure 1970        5.56
## 2         4 Afghanistan Final consumption expenditure 1971        5.33
## 3         4 Afghanistan Final consumption expenditure 1972        5.20
## 4         4 Afghanistan Final consumption expenditure 1973        5.75
## 5         4 Afghanistan Final consumption expenditure 1974        6.15
## 6         4 Afghanistan Final consumption expenditure 1975        6.32
country_list <- c("United States","India", "Germany") #Compare GDP components for these 3 countries


unique(tidy_GDP_data[c("IndicatorName")])
## # A tibble: 17 × 1
##    IndicatorName                                                                
##    <chr>                                                                        
##  1 Final consumption expenditure                                                
##  2 Household consumption expenditure (including Non-profit institutions serving…
##  3 General government final consumption expenditure                             
##  4 Gross capital formation                                                      
##  5 Gross fixed capital formation (including Acquisitions less disposals of valu…
##  6 Exports of goods and services                                                
##  7 Imports of goods and services                                                
##  8 Gross Domestic Product (GDP)                                                 
##  9 Agriculture, hunting, forestry, fishing (ISIC A-B)                           
## 10 Mining, Manufacturing, Utilities (ISIC C-E)                                  
## 11 Manufacturing (ISIC D)                                                       
## 12 Construction (ISIC F)                                                        
## 13 Wholesale, retail trade, restaurants and hotels (ISIC G-H)                   
## 14 Transport, storage and communication (ISIC I)                                
## 15 Other Activities (ISIC J-P)                                                  
## 16 Total Value Added                                                            
## 17 Changes in inventories

First, we will reproduce this plot:

raw_GDP_component_list <- c("Gross capital formation", "Exports of goods and services", "General government final consumption expenditure", "Household consumption expenditure (including Non-profit institutions serving households)", "Imports of goods and services", "Gross Domestic Product (GDP)")

GDP_component_list <- c("Gross capital formation", "Exports", "Government expenditure", "Household expenditure","Imports","GDP")

three_countries_data <- tidy_GDP_data %>% 
  filter(Country %in% country_list) %>% 
  mutate(IndicatorName = factor(IndicatorName,
                                levels = raw_GDP_component_list,
                                labels = GDP_component_list)) %>% 
    filter(IndicatorName %in% GDP_component_list) 

three_countries_data_no_GDP <- three_countries_data %>% 
  filter(IndicatorName != "GDP")

#head(plot_three_countries)

plot_three_countries <- ggplot(three_countries_data_no_GDP, aes(x = Year, 
                                 y = Country_GDP, 
                                 group = IndicatorName, 
                                 colour = IndicatorName))+
  scale_x_discrete(breaks = seq(1970, 2020, by = 10)) +
  geom_line(size = 1) +
  facet_wrap(vars(Country), scales = "free_x") +
  theme_bw()+
  labs (
    title = "GDP components over time",
    subtitle = "In constant 2010 USD",
    x = "",
    y = "Billion US$" ) + 
  guides(color=guide_legend(title="Components of GDP")) +
  NULL

plot_three_countries

GDP_component_cleaned_list <- c("Gross_capital_formation", "Exports", "Government_expenditure", "Household_expenditure","Imports","GDP")


original_GDP_data <- three_countries_data %>% 
    mutate(IndicatorName = factor(IndicatorName,
                                levels = GDP_component_list,
                                labels = GDP_component_cleaned_list)) %>%
  pivot_wider(names_from = IndicatorName,
              values_from = Country_GDP) %>% 
  mutate(GDP_calculated = Household_expenditure + Government_expenditure + Gross_capital_formation + Exports - Imports,
         #GDP_proportion = abs(GDP_calculated - GDP)/GDP,
         Household_expenditure_proportion = Household_expenditure / GDP,
         Government_expenditure_proportion = Government_expenditure / GDP,
         Gross_capital_formation_proportion = Gross_capital_formation / GDP,
         Net_Exports_proportion = (Exports - Imports) / GDP)


print.data.frame(head(original_GDP_data))
##   CountryID Country Year Household_expenditure Government_expenditure
## 1       276 Germany 1970                   872                    280
## 2       276 Germany 1971                   919                    298
## 3       276 Germany 1972                   969                    313
## 4       276 Germany 1973                   997                    332
## 5       276 Germany 1974                   995                    351
## 6       276 Germany 1975                  1032                    366
##   Gross_capital_formation Exports Imports  GDP GDP_calculated
## 1                     442     174     188 1534           1581
## 2                     444     178     202 1582           1638
## 3                     452     189     215 1650           1709
## 4                     464     210     222 1729           1779
## 5                     419     234     223 1744           1775
## 6                     392     220     230 1729           1780
##   Household_expenditure_proportion Government_expenditure_proportion
## 1                            0.568                             0.183
## 2                            0.581                             0.188
## 3                            0.587                             0.190
## 4                            0.576                             0.192
## 5                            0.570                             0.201
## 6                            0.597                             0.212
##   Gross_capital_formation_proportion Net_Exports_proportion
## 1                              0.288               -0.00883
## 2                              0.280               -0.01471
## 3                              0.274               -0.01534
## 4                              0.268               -0.00743
## 5                              0.240                0.00623
## 6                              0.226               -0.00580

Secondly, recall that GDP is the sum of Household Expenditure (Consumption C), Gross Capital Formation (business investment I), Government Expenditure (G) and Net Exports (exports - imports). Even though there is an indicator Gross Domestic Product (GDP) in your dataframe, I would like you to calculate it given its components discussed above.

What is the % difference between what you calculated as GDP and the GDP figure included in the dataframe?

#Find difference between actual and calculated GDP
diff_GDP_data <- three_countries_data %>% 
    mutate(IndicatorName = factor(IndicatorName,
                                levels = GDP_component_list,
                                labels = GDP_component_cleaned_list)) %>%
  pivot_wider(names_from = IndicatorName,
              values_from = Country_GDP) %>% 
  mutate(GDP_calculated = Household_expenditure + Government_expenditure + Gross_capital_formation + Exports - Imports,
         GDP_diff_proportion = abs(GDP_calculated - GDP)/GDP)

plot_diff<- ggplot(diff_GDP_data, aes(x = Year, y = GDP_diff_proportion, group = Country, colour = Country)) +
  scale_x_discrete(breaks = seq(1970, 2020, by = 10)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  geom_line(size = 0.9) +
  theme_bw()+
  labs (
    title = "% difference between what calculated as GDP and the GDP figure",
    x = "",
    y = "proportion" ) + 
  guides(color=guide_legend(title="Country")) +
  NULL

  
plot_diff

India has big difference, others not, having had 2% difference in latest year, maybe as bussiness investment (large part of India’s GDP) can be reported very differently to other countries listed, due to various differences in legal frameworks and reporting regulations.

Next, we are reproducing the above plot using the following code:

#Component as a proportion of original GDP (replicating the graph)

plot_GDP_percent_diff <- original_GDP_data %>% 
  pivot_longer(cols = 11:14, #columns 11 to 15
               names_to = "Component",
               values_to = "Proportion") 

ggplot(plot_GDP_percent_diff, aes(x = Year, 
                                 y = Proportion, 
                                 group = Component, 
                                 colour = Component))+
  scale_x_discrete(breaks = seq(1970, 2020, by = 10)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  geom_line(size = 0.9) +
  facet_wrap(vars(Country), scales = "free_x") +
  theme_bw()+
  labs (
    title = "GDP components over time",
    x = "",
    y = "proportion") + 
  guides(color=guide_legend(title="")) +
  NULL

#Component as a proportion of calculated GDP

calculated_GDP_data <- three_countries_data %>% 
    mutate(IndicatorName = factor(IndicatorName,
                                levels = GDP_component_list,
                                labels = GDP_component_cleaned_list)) %>%
  pivot_wider(names_from = IndicatorName,
              values_from = Country_GDP) %>% 
  mutate(GDP_calculated = Household_expenditure + Government_expenditure + Gross_capital_formation + Exports - Imports,
         Household_expenditure_proportion = Household_expenditure / GDP_calculated,
         Government_expenditure_proportion = Government_expenditure / GDP_calculated,
         Gross_capital_formation_proportion = Gross_capital_formation / GDP_calculated,
         Net_Exports_proportion = (Exports - Imports) / GDP_calculated)

plot_calculated_GDP <- calculated_GDP_data %>% 
  pivot_longer(cols = 11:14, #columns 11 to 15
               names_to = "Component",
               values_to = "Proportion") 

ggplot(plot_calculated_GDP, aes(x = Year, 
                                 y = Proportion, 
                                 group = Component, 
                                 colour = Component))+
  scale_x_discrete(breaks = seq(1970, 2020, by = 10)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  geom_line(size = 0.9) +
  facet_wrap(vars(Country), scales = "free_x") +
  theme_bw()+
  labs (
    title = "GDP and its breakdown at constant 2010 prices in US Dollars",
    x = "",
    y = "proportion" ) + 
  guides(color=guide_legend(title="")) +
  NULL

> What is this last chart telling you? Can you explain in a couple of paragraphs the different dynamic among these three countries?

India shows a large increase in investments (related to the post-1991 economic growth, particularly in the tech sector, with foreign investment there having grown massively), also showing household expenditure proportion dropping (development related, as more households have more disposable income), but still has a lower government expenditure)

The US is experiencing net exports dropping, partly due to cheaper manufacturing elsewhere, marking its transition to a service economy, and also drop in government expenditure, perhaps marking the turn towards austerity post-recessions.However, it still shows stability in gross capital formation.

Germany is showing a rise in net exports, odd for a Western nation, and is table on other fronts.

6 Details

  • Who did you collaborate with: N/A
  • Approximately how much time did you spend on this problem set: 2.5 hours
  • What, if anything, gave you the most trouble: Nothing specific

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?

Yes!